Àá½Ã¸¸ ±â´Ù·Á ÁÖ¼¼¿ä. ·ÎµùÁßÀÔ´Ï´Ù.
KMID : 0605620200270010018
Journal of Korean Society of Biological Psychiatry
2020 Volume.27 No. 1 p.18 ~ p.26
Application of Text-Classification Based Machine Learning in Predicting Psychiatric Diagnosis
Pak Doo-Hyun

Hwang Min-Gyu
Lee Min-Ji
Woo Sung-Il
Hahn Sang-Woo
Lee Yeon-Jung
Hwang Jae-Uk
Abstract
Objectives : The aim was to find effective vectorization and classification models to predict a psychiatric diagnosis from text-based medical records.

Methods : Electronic medical records (n = 494) of present illness were collected retrospectively in inpatient admission notes with three diagnoses of major depressive disorder, type 1 bipolar disorder, and schizophrenia. Data were split into 400 training data and 94 independent validation data. Data were vectorized by two different models such as term frequency-inverse document frequency (TF-IDF) and Doc2vec. Machine learning models for classification including stochastic gradient descent, logistic regression, support vector classification, and deep learning (DL) were applied to predict three psychiatric diagnoses. Five-fold cross-validation was used to find an effective model. Metrics such as accuracy, precision, recall, and F1-score were measured for comparison between the models.

Results : Five-fold cross-validation in training data showed DL model with Doc2vec was the most effective model to predict the diagnosis (accuracy = 0.87, F1-score = 0.87). However, these metrics have been reduced in independent test data set with final working DL models (accuracy = 0.79, F1-score = 0.79), while the model of logistic regression and support vector machine with Doc2vec showed slightly better performance (accuracy = 0.80, F1-score = 0.80) than the DL models with Doc2vec and others with TF-IDF.

Conclusions : The current results suggest that the vectorization may have more impact on the performance of classification than the machine learning model. However, data set had a number of limitations including small sample size, imbalance among the category, and its generalizability. With this regard, the need for research with multi-sites and large samples is suggested to improve the machine learning models.
KEYWORD
Text-classification, Electronic medical record, Vectorization, Machine learning, Present illness, Psychiatric diagnosis
FullTexts / Linksout information
 
Listed journal information
ÇмúÁøÈïÀç´Ü(KCI) KoreaMed ´ëÇÑÀÇÇÐȸ ȸ¿ø